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Abstract 

We consider the power to reject false values of the parameter in Frequentist 
methods for the calculation of confidence intervals. We connect the power with 
the physical significance (reliability) of confidence intervals for a parameter 
bounded to be non-negative. We show that the confidence intervals (upper 
limits) obtained with a (biased) method that near the boundary has large 
power in testing the parameter against larger alternatives and small power in 
testing the parameter against smaller alternatives are physically more signif- 
icant. Considering the recently proposed methods with correct coverage, we 
show that the physical significance of upper limits is smallest in the Unified 
Approach and highest in the Maximum Likelihood Estimator method. We 
illustrate our arguments in the specific cases of a bounded Gaussian distribu- 
tion and a Poisson distribution with known background. 

PACS numbers: 06.20.Dk 
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I. INTRODUCTION 



The problem of calculating Frequentist confidence intervals and upper limits has re- 
ceived recently new contributions [p]-[TT[l and has been subject of intense discussions (see 
Refs. ||T^-[14|). Three new methods with correct coverage (see Section |lD have been pro- 
posed: Unified Approach Bayesian Ordering |^ and Maximum Likelihood EstimatoiQ 
i- 

In this paper we consider the power of Frequentist methods to reject false values of the 
parameter under investigation and we connect it with the physical significance of confidence 
intervals. In this contest the physical significance of a confidence interval is its degree of 
reliability. For example, an unbelievably small upper limit below the sensitivity of the 
experiment has negligible physical significance (one could even argue that it has "negative" 
physical significance, since it gives misleading information to those who believe in it). Empty 
confidence intervals are practically useless and physically insignificant. 

In Section || we review the coverage of Frequentist methods and their power to reject 
false values of the parameter. Section |T| constitutes the main part of the paper, in which 
considering a non-negative parameter, we discuss the power of confidence intervals and 
their physical significance. In Sections |I^ and we illustrate the arguments presented 



in Section |IT| using as examples a Gaussian distribution with mean /i > and a Poisson 
distribution with known background, respectively. 



II. COVERAGE AND POWER 

An important property of Frequentist confidence intervals is coverage. A method for 
the calculation of confidence intervals has correct coverage if its confidence intervals with 
100(1 — a)% confidence level (CL) belong to a set of confidence intervals that can be obtained 
with a large ensemble of experiments, 100(1 — a)% of which contain the true value of the 
parameter (see, for example, Refs. |T3|-|T7|,0). In other words, if coverage is satisfied, a 
100(1 — a)% CL confidence interval has a probability 1 — a to cover the true value of the 
parameter. 

Confidence intervals with correct coverage can be calculated using Neyman's method 



In this method, 100(1 — a)% CL confidence intervals for a quantity (i are obtained through 
the construction of a confidence belt in the plane where fi is an appropriate estimator 
of 11. For each possible value of /z one calculates an acceptance interval of the estimator 
with integral probability 1 — a. The union of all the acceptance intervals constitutes the 
confidence belt. The confidence interval resulting from a measurement of fi is given by all 
the values of /i whose acceptance interval include the measured value of jl (see, for example. 



Refs. ITlJTg). 

Coverage, however, is not the only quantity that is important in the construction of 
confidence intervals. Another quantity, called power., is related to the probability to reject 



^ We introduce here this name for the method proposed by Mandelkern and Schultz in Ref. 
since it is based on the use of the estimator derived from maximum likeUhood. 
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false values of the parameter. Coverage and power are connected, respectively, with the 
so-called Type I and Type II errors in testing a simple statistical hypothesis Hq against a 
simple alternative hypothesis Hi (see Ref. |T6|, section 20.9): 



Type I error: Reject the null hypothesis Hq when it is true. The probability of a Type I 
error is called size of the test and it is usually denoted by a. 

Type II error: Accept the null hypothesis Hq when the alternative hypothesis Hi is true. 
The probability of a Type II error is usually denoted by /?. The power of a test is the 
probability vr = 1 — /3 to reject Hq if Hi is true. A test is most powerful if its power is 
the largest one among all possible tests. 

In Neyman's method, whatever is the true value of n, the probability that it is not 
included in a 100(1 — a)% CL confidence interval is a. From the point of view of hypothesis 
testing, if one considers a possible value /xq of as a null hypothesis, Hq: fi = /xq, the 
probability to reject fiQ if it is true is a, i.e. there is a probability a to make a Type I 
error. In other words, the acceptance interval of the estimator fi corresponding to fiQ is the 
acceptance region of the test and the complementary interval is the critical region of the 
test. If the measured value of fi falls in the critical region, the null hypothesis is rejected. 
This happens with a probability a if the null hypothesis is true, as required by coverage. 

The property of coverage is not sufficient to specify uniquely how to construct the confi- 
dence belt. Different Frequentist methods with correct coverage follow different prescriptions 
for the definition of the acceptance intervals. The associated probability a of a Type I error 
is the same, but the probability /5 of a Type II error and the corresponding power n = 1 — j3 
are different. 

Unfortunately, the power associated with a confidence belt is not easy to evaluate, be- 
cause for each possible value /xq of /i considered as a null hypothesis there is no simple 
alternative hypothesis that allows to calculate the probability /5 of a Type II error. Instead, 
we have the alternative hypothesis Hi. fii fiQ, which is composite. For each value of 
fj,i 7^ fiQ one can calculate the probability /5^o(/^i) ^ Type II error associated with a given 
acceptance interval corresponding to /iq. A method that gives an acceptance region for /ig 
which has the largest possible power 7r^„(/ii) = 1 — jS^Q^jJii) if /Ui is true is most powerful 
with respect to the alternative /ii. Clearly, it would be desirable to find a uniformly most 
powerful test, i.e. a test that gives an acceptance region for /io which has the largest possible 
power vr^o(/ii) for any value of ^i. Unfortunately, the Neyman-Pearson lemma implies that 
in general a uniformly most powerful test does not exist if the alternative hypothesis is two- 
sided, i.e. both yUi < /io and yUi > /io are possible, and the derivative of the Likelihood with 
respect to /i is continuous in /io (see Ref. |T^, section 20.18). Nevertheless, it is possible to 
find a uniformly most powerful test if the class of tests is restricted in appropriate ways. A 
class of tests that has some merit is the class of unbiased tests, such that 



i.e. the probability of rejecting /io when it is false is at least as large as the probability of 
rejecting /io when it is true. The equal-tail test used in the Central Intervals method is un- 
biased and uniformly most powerful unbiased for distributions belonging to the exponential 




(2.1) 
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family, such as, for example, the Gaussian and Poisson distributions (see Ref. [T^, section 
21.31). 

Therefore, the Central Intervals method, is widely used because it corresponds to a uni- 
formly most powerful unbiased test. Other methods based on asymmetric tests unavoidably 
introduce some bias. 

III. POWER NEAR A BOUNDARY 

In some cases the Central Intervals method is not satisfactory. The two cases which occur 
often in physics are the Gaussian distribution with a mean /i physically bounded to be non- 
negative0 and the Poisson distribution with mean /i > and a known background. In these 
cases the Central Intervals method sometimes produces empty confidence intervals, that are 
physically useless. The recently proposed Frequentist methods with correct coverage |]l],||,|^ 
cure this problem considering appropriate constructions of the confidence belt that guarantee 
a transition from two-sided confidence intervals to upper limits near the boundary for the 
Gaussian distribution and for small number of counts in the case of the Poisson distribution. 
From the discussion at the end of Section p, it is clear that these methods are biased. In 
particular, since the acceptance intervals shift towards lower values of the estimator, which 
are more likely for smallest values of /x, these methods have a bias in testing the alternatives 
fii < /iQ. On the other hand, these methods are more powerful than the Central Intervals 
method in the test of the alternatives fii > fiQ. 

Near the boundary yU > 0, where the various methods produce different results, it is 
clearly much more important to test the alternatives fii > fiQ than the alternatives /ii < /xq, 
which are limited. In other words, the boundary introduces an asymmetry in the importance 
of the fii > /io and fii < /io alternatives. Moreover, experiments are made to search for a 
signal and when small signals are searched for (small /i near the boundary /x > 0), testing 
the alternatives /ii > /io is physically more meaningful than testing the alternatives /ii < /xq- 
Having more power in the test of the alternatives > fio means that values of fi smaller 
than the true one are less likely to be accepted, i.e. the experiment is more sensitive to a 
possible signal. 

In general the loss of power in testing /xq against fii < fio when /xq is near the physical 
boundary leads to less stringent upper limits. If the experiment is not sensitive to values of 
/i near the physical boundary, it does not make any sense to test /iq against /ii < fiQ near 
the boundary. Hence, a loss of power for this test is actually desirable and leads to more 
reliable upper limits. Indeed, the Central Intervals method, which is unbiased and has a 
large power in testing /iq against /ii < /iq near the boundary, gives practically useless upper 
limits. 

These considerations imply that a Frequentist method produces upper limits which are 
physically significant if near the physical boundary it has a large power in testing /iq against 
fii > /io and small power in testing /iq against Hi < Hq. 



In general one can consider any boundary for /i. We consider only the case /i > for simplicity. 
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Among the recently proposed Frequentist methods with correct coverage (Unified Ap- 
proach [|^, Bayesian Ordering 0], Maximum Likehhood Estimator P]), the shift of the ac- 
ceptance intervals towards lower values of the estimator is smallest in the Unified Approach 
imi. We will show this fact explicitly in Section |^ for a bounded Gaussian distribution, 
but it is true also for a Poisson distribution with known background. Therefore, the Uni- 
fied Approach is the less powerful among new methods in testing fi against the alternatives 
fii > fi, which are more important near the boundary than the alternatives fii < fi, for 
which the Unified Approach has the highest power. In other words, the Unified Approach is 
less sensitive than the other methods to positive signals, because small values of fi are more 
likely to be accepted if the true value of fi is large. It also produces upper limits that are 
too stringent and unreliable from the physical point of view |3|,p|,|6ip!9ip0| , because it has too 
much power in testing fi against the alternatives fii < fi near the boundary. 



IV. GAUSSIAN DISTRIBUTION WITH BOUNDARY 

In order to illustrate the power of different methods, let us consider an observable x with 
Gaussian distribution around a non-negative mean fi and a standard deviation a, assumed 
to be known. In this case x is the estimator of fi {fi = x) and a measurement of x gives a 
confidence interval for fi, which depend on the chosen method. 

Figure shows the 90% CL confidence belts {a = 0.10) for a = 1 corresponding to four 
different methods: the standard Central Intervals method and the three new methods with 
correct coverage. Unified Approach [0], Bayesian Ordering^ and Maximum Likelihood 
Estimator For a; ^ a all the methods produce the same results, far from the boundary 
fi > 0. The Central Intervals method gives an empty confidence interval for x < —1.64. 
The other three methods give non-empty confidence intervals for any value of x, which 
become upper limits for x < 1.28. For negative values of x the Unified Approach give the 
most stringent upper limits, whereas the Maximum Likelihood Estimator gives the upper 
limit fi < 1.65 for any value of x < 0. At fi = 1.65, the confidence belt obtained with 
the Maximum Likelihood Estimator method has discontinuous derivative of the left edge at 
X = 0, and a discontinuity of the right edge from x = 3.30 to a: = 2.92. The confidence 
belt obtained with the Unified Approach has discontinuous derivatives of both left and right 
edges at fi = 1.65, where the left edge has a; = and the confidence belt start to deviate 
from the Central Intervals confidence belt for decreasing x. Both left and right edges of the 
confidence belt obtained with the Bayesian Ordering are smooth for all values of x. 

For small values of fi the acceptance intervals in the Unified Approach, Bayesian Ordering 
and Maximum Likelihood Estimator are increasingly shifted to the left, with respect to those 
in the Central Intervals method. Hence, among these methods, the Maximum Likelihood 
Estimator has highest power in testing small values of fi against larger alternatives, followed 



^ In Ref. the Bayesian Ordering method is discussed explicitly only for the case of a Poisson 
distribution with known background, but, as noted there, the method can be generalized in a 
straightforward way to other cases, as that considered here. 
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by the Bayesian Ordering and then by the Unified Approach. The order of the power of the 
four methods in testing small values of fi against smaller alternatives is reversed. 

In order to give a quantitative illustration of the power of the four methods under con- 
sideration, let us define the positive average power function 

1 



= -r- / 7r^(/ii)d/ii , (4.1) 



for an arbitrary A/i, and the negative average power function 



1 /"^ 

(^/.)_ = - / 7r^(/ii)d/ii . (4.2) 
Jo 

These two functions, for a Gaussian distribution with mean /x > 0, standard deviation a = 1, 
a = 0.10 and A/i = 1, are plotted in Fig. |[ 

From Fig. |^ one can see that the Maximum Likelihood Estimator method has the 
highest power with respect to the alternatives /xi > if < 1.65. The power of the 
Bayesian Ordering method is higher than that of the Unified Approach and both tend to 
the constant power of the Central Intervals method for large values of /x. All methods are 
unbiased in testing larger alternatives ((vr^j)^ > a = 0.1). 

Figure |p3 shows that the order of the power of the four methods is reversed when smaller 
alternatives are considered and only the Central Intervals method is unbiased. The curves in 
Fig. increase with increasing /i because the range < /ii < /i increases with /i and values 
of III far from /i lead to higher values of the power. The value of (vr^)_ tends to a = 0.10 
for /i — i> in all methods, because by definition TT^g{fii) —* a for /xi — > /xq and the interval of 



integration in Eq. ( |4.2|) shrinks to zero for ^ ^ 0. As noted in Section |T|, a small power in 



testing smaller alternatives is desirable in order to obtain physically reliable upper limits. 



Let us emphasize that the arbitrary definition of the quantities in Eqs.(4J.) and ( |4.2|) 
is irrelevant for the quality of our conclusions. Choosing other appropriate quantities one 
always obtains the same classification (Central Intervals, Unified Approach, Bayesian Order- 
ing, Maximum Likelihood Estimator) of the four considered methods in order of increasing 
power to test larger alternatives and decreasing power to test smaller alternatives. 



V. POISSON DISTRIBUTION WITH BACKGROUND 

In this section we discuss as another example the case of a Poisson distribution of counts 
n with mean signal fi and known background b = 5.2. We consider the same four methods 
already considered in the previous section: Central Intervals, Unified Approach |l[], Bayesian 
Ordering |^ and Maximum Likelihood Estimator Since the discreteness of n does not 
allow to construct exact Central Intervals, we follow the prescription described in Ref. [0. 

Figure ^ shows the 90% CL confidence belts in the four methods. One can see that 
for n < 6 the Maximum Likelihood Estimator gives a constant upper limit /x < 4.8, the 
Bayesian Ordering and Unified approach methods give upper limits that decrease with n, 
and the Central Intervals method gives an empty confidence interval for n < 2. 

Because of the discreteness of n, the acceptance intervals do not have exact integral 
probability 1 — a. Their integral probability 1 — ct^ is shown in Fig. | as a function of /x in 
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the four considered methods. One can see that the dependence of 1 — from /i is very wild. 
As a consequence, also the average power functions in Eqs. and (|4.2| ) are wild functions 
of /i. In order to obtain smoother functions of /i, whose behaviour is not too difficult to be 
interpreted, we consider the ratios {T!'^)_^/a^ and (vr^)_/a^. These ratios are appropriate 
to evidence the presence of a bias, which manifest itself when a ratio becomes less then one. 

Figure shows the ratio (vr^,)^ /a/^, with Afi = 1 (see Eq. ( [4.1|) ). In spite of the 
precaution to divide (vr^)_|_ by a^, the curves still have wild jumps because of the discreteness 
of n. From Fig. one can see that for most small values of fi the Maximum Likelihood 
Estimator has the highest power in testing /i against larger alternatives, followed in order 
by Bayesian Ordering, Unified Approach and Central Intervals. The Unified Approach has 
higher (or equal) power than Central Intervals for /i < 2.6, and smaller (or equal) power 
mfor highest values of /i. 

Figure |B shows the ratio (vr^)_ /a^^ (see Eq. ( [4.2|) ). One can see that even the Central 
Intervals method is slightly biased. This is due to the fact that, because of the discreteness 
of n, it is not possible to construct exactly central acceptance intervals. Fig. |^ shows that 
Central Intervals and Unified Approach have the highest power in testing n against smaller 
alternatives (Central Intervals for fi < 2.2 and Unified Approach for highest values of fi). 
The Maximum Likelihood Estimator method has the smallest power for /i < 4.8. 

In conclusion of this section, we have shown with a specific example that also in the 
case of a Poisson process with background, among the recently proposed methods with 
correct coverage. Maximum Likelihood Estimator yields confidence intervals with the highest 
physical significance (with the criteria discussed in Section , followed in order by Bayesian 
Ordering and Unified Approach. 



VI. CONCLUSIONS 

In conclusion, we have considered the power of Frequentist methods, which quantifies 
their capability to avoid Type II errors. We have connected the power with the physical 
significance of confidence intervals, i. e. their degree of reliability. 

Considering the case of a parameter bounded to be non-negative, we have shown that 
near the boundary a (biased) method that has large power in testing the parameter against 
larger alternatives and small power in testing the parameter against smaller alternatives 
produces confidence intervals (upper limits) that are physically more significant. 

We have shown that among the recently proposed Frequentist methods with correct 
coverage (Unified Approach |^, Bayesian Ordering 0, Maximum Likelihood Estimator 0), 
the widely used Unified Approach yields upper limits with the smallest physical significance. 
The upper limits with the highest physical significance are produced by the Maximum 
Likelihood Estimator method. 

We have illustrated our arguments in the cases of a bounded Gaussian distribution (Sec- 
tion and a Poisson distribution with known background (Section |^). 
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FIG. 1. 90% CL confidence belts (a = 0.10) in four methods for an observable x with Gaussian 
distribution around a positive mean /i and a standard deviation a = 1. Solid lines: Central 
Intervals; Dashed lines: Unified Approach Dotted lines: Bayesian Ordering Q; Dash-dotted 
lines: Maximum Likelihood Estimator |p. 
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FIG. 2. Averaged power functions of the 90% CL confidence belts (a = 0.10) in four methods 
for a Gaussian distribution around a positive mean fi and a standard deviation a = 1. (A): positive 
average power function (vr^)^ with A/i = 1 (see Eq. (|4.lD ); (B): negative average power function 
(see Eq. p.2| )). Sohd hue: Central Intervals; Dashed line: Unified Approach Dotted line: 
Bayesian Ordering |p; Dash-dotted line: Maximum Likelihood Estimator [Q. 
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FIG. 3. 90% CL confidence belts (a = 0.10) in four methods for a Poisson distribution of 
counts n with mean signal /x and known background b = 5.2. Solid lines: Central Intervals; Dashed 
lines: Unified Approach [|^; Dotted lines: Bayesian Ordering j^; Dash-dotted lines: Maximum 
Likelihood Estimator Wi. 
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FIG. 4. Integral probability 1 — of the acceptance intervals of the 90% CL confidence belts 
(a = 0.10) in four methods for a Poisson distribution of counts n with mean signal /x and known 
background b = 5.2. Solid line: Central Intervals; Dashed line: Unified Approach Dotted line: 
Bayesian Ordering j^; Dash-dotted line: Maximum Likelihood Estimator j^. 
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FIG. 5. Normalized average power functions of the 90% CL confidence belts (a = 0.10) in 
four methods for a Poisson distribution of counts n with mean signal /i and known background 
b = 5.2. (A): normalized positive average power function (vr^) , /a^ with A/i = 1 (see Eq. ( fl.lD ); 



(B): normalized negative average power function (vr^)_ /a^ (see Eq. (4^)). Solid line: Central 
Intervals; Dashed line: Unified Approach [|^; Dotted line: Bayesian Ordering Q; Dash-dotted line: 
Maximum Likelihood Estimator Hi. 
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